How to speed up Python using Cython and Pythran#

Python is great, because it’s easy to program and a lot of advanced functionality is either build in or is available through modules. However, sometimes it is really just too slow.

Here we are going to look at how to use two extensions to Python, which compile python code to C or C++ to greatly speed up your python code.

Go to https://python4photonics.org/ to get the notebook

The test function#

To demonstrate the speed-ups one can achieve we are going to look at a typical DSP function: making symbol decisions.

The function requirements#

For each symbol in a (noisy) input signal array find the nearest symbol from a given alphabet, using eucledian distance.

Input Parameters:

  • input signal array

  • input alphabet array

Output parameters:

  • output signal with the same shape as the input signal composed of symbols from the alphabet

Timing#

We are going to use the timeit magic function in jupyter for timing our efforst, in more complex code you might want to use a profiler instead

Generate test data#

Lets start of with some easy test data so we now that everything works

# import the base modules we will use

%pylab inline
import numpy as np
from qampy import signals, impairments # convenient way of quickly generating a signals and symbol arrays
%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib
# generate a signal array with some added noise
qpsk = signals.SignalQAMGrayCoded(4, 10, nmodes=1) # generate a single polarization QPSK signal with 100 samples)
qpsk = impairments.change_snr(qpsk, 15) # change the SNR of the signal to 15dB
plt.figure(figsize=(10,10))
plt.plot(qpsk[0].real, qpsk[0].imag, '.') # plot the constellation
plt.plot(qpsk.coded_symbols.real, qpsk.coded_symbols.imag, 'x', ms=10) # plot the alphabet on top
[<matplotlib.lines.Line2D at 0x72c918bd9060>]
_images/081678eba59e73e56ba48b7db6895bfc887240864d54dfad735938b5522ee2d1.png

Naive Python implementation#

Let us start of by implementing a naive Python implementation

def make_decision_py(signal, alphabet):
    out = np.zeros_like(signal)
    for i in range(signal.size): # we only do 1D here
        dmin = 1000
        for j in range(alphabet.size):
            d = abs(signal[i]-alphabet[j])
            if d < dmin:
                idx = j
                dmin = d
        out[i] = alphabet[idx]
    return out

Test that it works

out = make_decision_py(qpsk[0], qpsk.coded_symbols)
print(out)
print(qpsk.symbols[0]) # these are the transmitted symbols
print(qpsk.symbols[0]==out)
[-0.70710678-0.70710678j  0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
 -0.70710678-0.70710678j]
[-0.70710678-0.70710678j  0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
 -0.70710678-0.70710678j]
[ True  True  True  True  True  True  True  True  True  True]

Time the code#

Lets do some timing of the code

%timeit make_decision_py(qpsk[0], qpsk.coded_symbols)
12.1 µs ± 470 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

That is still reasonably fast, but it is a very short array and only QPSK let us have a look at some more realistic data

# generate a signal array with some added noise
qam64 = signals.SignalQAMGrayCoded(64, 10**4, nmodes=1) # generate a single polarization 128-QAM signal with 10 000 samples)
qam64 = impairments.change_snr(qam64, 30) # change the SNR of the signal to 30dB
plt.figure(figsize=(10,10))
plt.plot(qam64[0].real, qam64[0].imag, '.') # plot the constellation
plt.plot(qam64.coded_symbols.real, qam64.coded_symbols.imag, 'x', ms=10) # plot the alphabet on top
[<matplotlib.lines.Line2D at 0x72c9161c3f70>]
_images/70c5609e84ff14e55b599625db99d543a7d9a7648a8222f0c96e5c12759d602a.png
%timeit make_decision_py(qam64[0], qam64.coded_symbols)
105 ms ± 832 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

That is a lot slower and still quite a short array!

Numpy Implementation#

Generally you should always use the numpy vector functions if you can because they will give you a significant speedup. So let us do that

def make_decision_np(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out
out = make_decision_np(qpsk[0], qpsk.coded_symbols)
print(out)
print(qpsk.symbols[0]) # these are the transmitted symbols
print(qpsk.symbols[0]==out)
[-0.70710678-0.70710678j  0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
 -0.70710678-0.70710678j]
[-0.70710678-0.70710678j  0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678-0.70710678j  0.70710678+0.70710678j
  0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
 -0.70710678-0.70710678j]
[ True  True  True  True  True  True  True  True  True  True]
%timeit make_decision_np(qam64[0], qam64.coded_symbols)
1.27 ms ± 7.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That is a factor of 20 faster!

Speeding up#

OK let’s try to speed up this code some more

I will use two packages here

Cython

Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language. It essentially allows you to write modules in an extended Python dialect that can be compile directly to C and provide significant speed-ups by avoiding most of the Python overheads. Cython has been around for a long time, so it is realtively mature. It is also great for wrapping C-libraries for use with Python.

Pythran

Pythran is an ahead of time compiler for a subset of the Python language, with a focus on scientific computing. It takes a Python module annotated with a few interface description and turns it into a native Python module with the same interface, but (hopefully) faster. Pythran is much younger than Cython, but has been progressing at an impressive pace.

Both Pythran and Cython can be used directly inside jupyter notebooks which we will use here. However, they are really aimed at being used for compiling modules and that is how I usually use them.

Cython#

Lets start with Python. Our starting point is the Numpy code above

import cython # the cython module
%load_ext cython
%%cython 
import numpy as np
cimport numpy as np
def make_decision_cy(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out
%timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.24 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

OK that was rather dissappointing no speed-up compared to the numpy code. However if we read the Cython documentation we find out that we really should be annotating the code to get significant speed-ups.

The first thing we can do is turn off several of the checks that Python usually does

%%cython 
import numpy as np
cimport numpy as np
import cython

@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out
%timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.25 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So again no increase in speed, but considering the code is compiled what if we add some compiler flags?

%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
import numpy as np
cimport numpy as np
import cython

@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out
%timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.25 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Again no speed up. But we haven’t told cython what numeric types we are dealing with so let us add some type annotations

%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
cimport numpy as np
import cython

@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet): # two array type annotations
    cdef double complex[:] out # this is a memoryview which essentially tells cython this is a block of memory similar to the ndarray annotations above
    cdef double[:,:] d
    cdef long[:] idx
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.25 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Again no speed improvement. What is going on?

Let us have a look at the code. Cython provides the --annotate argument, which tells us how the code is compiled

%%cython --annotate
import numpy as np
cimport numpy as np
import cython

@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet): # two array type annotations
    cdef double complex[:] out # this is a memoryview which essentially tells cython this is a block of memory similar to the ndarray annotations above
    cdef double[:,:] d
    cdef long[:] idx
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return np.array(out)
Cython: _cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a.pyx

Generated by Cython 3.0.8

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

+01: import numpy as np
  __pyx_t_7 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_7) < 0) __PYX_ERR(1, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
/* … */
  __pyx_t_7 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_7) < 0) __PYX_ERR(1, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
 02: cimport numpy as np
 03: import cython
 04: 
+05: @cython.boundscheck(False) # won't check that index is in bounds of array
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy = {"make_decision_cy", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  PyArrayObject *__pyx_v_signal = 0;
  PyArrayObject *__pyx_v_alphabet = 0;
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("make_decision_cy (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args); if (unlikely(__pyx_nargs < 0)) return NULL;
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_signal,&__pyx_n_s_alphabet,0};
  PyObject* values[2] = {0,0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_signal)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(1, 5, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (likely((values[1] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_alphabet)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[1]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(1, 5, __pyx_L3_error)
        else {
          __Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, 1); __PYX_ERR(1, 5, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "make_decision_cy") < 0)) __PYX_ERR(1, 5, __pyx_L3_error)
      }
    } else if (unlikely(__pyx_nargs != 2)) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
      values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
    }
    __pyx_v_signal = ((PyArrayObject *)values[0]);
    __pyx_v_alphabet = ((PyArrayObject *)values[1]);
  }
  goto __pyx_L6_skip;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, __pyx_nargs); __PYX_ERR(1, 5, __pyx_L3_error)
  __pyx_L6_skip:;
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_AddTraceback("_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  if (unlikely(!__Pyx_ArgTypeTest(((PyObject *)__pyx_v_signal), __pyx_ptype_5numpy_ndarray, 1, "signal", 0))) __PYX_ERR(1, 9, __pyx_L1_error)
  if (unlikely(!__Pyx_ArgTypeTest(((PyObject *)__pyx_v_alphabet), __pyx_ptype_5numpy_ndarray, 1, "alphabet", 0))) __PYX_ERR(1, 9, __pyx_L1_error)
  __pyx_r = __pyx_pf_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_make_decision_cy(__pyx_self, __pyx_v_signal, __pyx_v_alphabet);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  goto __pyx_L0;
  __pyx_L1_error:;
  __pyx_r = NULL;
  __pyx_L0:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_make_decision_cy(CYTHON_UNUSED PyObject *__pyx_self, PyArrayObject *__pyx_v_signal, PyArrayObject *__pyx_v_alphabet) {
  __Pyx_memviewslice __pyx_v_out = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_d = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_idx = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_LocalBuf_ND __pyx_pybuffernd_alphabet;
  __Pyx_Buffer __pyx_pybuffer_alphabet;
  __Pyx_LocalBuf_ND __pyx_pybuffernd_signal;
  __Pyx_Buffer __pyx_pybuffer_signal;
  PyObject *__pyx_r = NULL;
  __pyx_pybuffer_signal.pybuffer.buf = NULL;
  __pyx_pybuffer_signal.refcount = 0;
  __pyx_pybuffernd_signal.data = NULL;
  __pyx_pybuffernd_signal.rcbuffer = &__pyx_pybuffer_signal;
  __pyx_pybuffer_alphabet.pybuffer.buf = NULL;
  __pyx_pybuffer_alphabet.refcount = 0;
  __pyx_pybuffernd_alphabet.data = NULL;
  __pyx_pybuffernd_alphabet.rcbuffer = &__pyx_pybuffer_alphabet;
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_signal.rcbuffer->pybuffer, (PyObject*)__pyx_v_signal, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) __PYX_ERR(1, 5, __pyx_L1_error)
  }
  __pyx_pybuffernd_signal.diminfo[0].strides = __pyx_pybuffernd_signal.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_signal.diminfo[0].shape = __pyx_pybuffernd_signal.rcbuffer->pybuffer.shape[0];
  {
    __Pyx_BufFmt_StackElem __pyx_stack[1];
    if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_alphabet.rcbuffer->pybuffer, (PyObject*)__pyx_v_alphabet, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) __PYX_ERR(1, 5, __pyx_L1_error)
  }
  __pyx_pybuffernd_alphabet.diminfo[0].strides = __pyx_pybuffernd_alphabet.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_alphabet.diminfo[0].shape = __pyx_pybuffernd_alphabet.rcbuffer->pybuffer.shape[0];
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_5, 1);
  __Pyx_XDECREF(__pyx_t_6);
  __Pyx_XDECREF(__pyx_t_7);
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_8, 1);
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_9, 1);
  { PyObject *__pyx_type, *__pyx_value, *__pyx_tb;
    __Pyx_PyThreadState_declare
    __Pyx_PyThreadState_assign
    __Pyx_ErrFetch(&__pyx_type, &__pyx_value, &__pyx_tb);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_alphabet.rcbuffer->pybuffer);
    __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_signal.rcbuffer->pybuffer);
  __Pyx_ErrRestore(__pyx_type, __pyx_value, __pyx_tb);}
  __Pyx_AddTraceback("_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  goto __pyx_L2;
  __pyx_L0:;
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_alphabet.rcbuffer->pybuffer);
  __Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_signal.rcbuffer->pybuffer);
  __pyx_L2:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_out, 1);
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_d, 1);
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_idx, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__22 = PyTuple_Pack(5, __pyx_n_s_signal, __pyx_n_s_alphabet, __pyx_n_s_out, __pyx_n_s_d, __pyx_n_s_idx); if (unlikely(!__pyx_tuple__22)) __PYX_ERR(1, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__22);
  __Pyx_GIVEREF(__pyx_tuple__22);
/* … */
  __pyx_t_7 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy, 0, __pyx_n_s_make_decision_cy, NULL, __pyx_n_s_cython_magic_7806802036a512ad86, __pyx_d, ((PyObject *)__pyx_codeobj__23)); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_make_decision_cy, __pyx_t_7) < 0) __PYX_ERR(1, 5, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
 06: @cython.wraparound(False) # array[-1] won't work
 07: @cython.nonecheck(False) # variables are never set to None
 08: @cython.cdivision(True) # don't protect against dividing by zero
 09: def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet): # two array type annotations
 10:     cdef double complex[:] out # this is a memoryview which essentially tells cython this is a block of memory similar to the ndarray annotations above
 11:     cdef double[:,:] d
 12:     cdef long[:] idx
+13:     out = np.zeros_like(signal)
  __Pyx_GetModuleGlobalName(__pyx_t_2, __pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 13, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros_like); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 13, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = NULL;
  __pyx_t_4 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_2 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_2)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_2);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_4 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_2, ((PyObject *)__pyx_v_signal)};
    __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_4, 1+__pyx_t_4);
    __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0;
    if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 13, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  }
  __pyx_t_5 = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_5.memview)) __PYX_ERR(1, 13, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_out = __pyx_t_5;
  __pyx_t_5.memview = NULL;
  __pyx_t_5.data = NULL;
+14:     d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
  __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_abs); if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_6 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_newaxis); if (unlikely(!__pyx_t_6)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_6);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = PyTuple_New(2); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_GIVEREF(__pyx_t_6);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 0, __pyx_t_6)) __PYX_ERR(1, 14, __pyx_L1_error);
  __Pyx_INCREF(__pyx_slice__5);
  __Pyx_GIVEREF(__pyx_slice__5);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 1, __pyx_slice__5)) __PYX_ERR(1, 14, __pyx_L1_error);
  __pyx_t_6 = 0;
  __pyx_t_6 = __Pyx_PyObject_GetItem(((PyObject *)__pyx_v_signal), __pyx_t_3); if (unlikely(!__pyx_t_6)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_6);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_7 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_newaxis); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = PyTuple_New(2); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_INCREF(__pyx_slice__5);
  __Pyx_GIVEREF(__pyx_slice__5);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 0, __pyx_slice__5)) __PYX_ERR(1, 14, __pyx_L1_error);
  __Pyx_GIVEREF(__pyx_t_7);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 1, __pyx_t_7)) __PYX_ERR(1, 14, __pyx_L1_error);
  __pyx_t_7 = 0;
  __pyx_t_7 = __Pyx_PyObject_GetItem(((PyObject *)__pyx_v_alphabet), __pyx_t_3); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_3 = PyNumber_Subtract(__pyx_t_6, __pyx_t_7); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  __pyx_t_7 = NULL;
  __pyx_t_4 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_2))) {
    __pyx_t_7 = PyMethod_GET_SELF(__pyx_t_2);
    if (likely(__pyx_t_7)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_2);
      __Pyx_INCREF(__pyx_t_7);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_2, function);
      __pyx_t_4 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_7, __pyx_t_3};
    __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_2, __pyx_callargs+1-__pyx_t_4, 1+__pyx_t_4);
    __Pyx_XDECREF(__pyx_t_7); __pyx_t_7 = 0;
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 14, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  }
  __pyx_t_8 = __Pyx_PyObject_to_MemoryviewSlice_dsds_double(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_8.memview)) __PYX_ERR(1, 14, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_d = __pyx_t_8;
  __pyx_t_8.memview = NULL;
  __pyx_t_8.data = NULL;
+15:     idx = np.argmin(d, axis=0)
  __Pyx_GetModuleGlobalName(__pyx_t_1, __pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_1, __pyx_n_s_argmin); if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 15, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_1 = __pyx_memoryview_fromslice(__pyx_v_d, 2, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_3 = PyTuple_New(1); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 15, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_GIVEREF(__pyx_t_1);
  if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 0, __pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error);
  __pyx_t_1 = 0;
  __pyx_t_1 = __Pyx_PyDict_NewPresized(1); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_t_1, __pyx_n_s_axis, __pyx_int_0) < 0) __PYX_ERR(1, 15, __pyx_L1_error)
  __pyx_t_7 = __Pyx_PyObject_Call(__pyx_t_2, __pyx_t_3, __pyx_t_1); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 15, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_9 = __Pyx_PyObject_to_MemoryviewSlice_ds_long(__pyx_t_7, PyBUF_WRITABLE); if (unlikely(!__pyx_t_9.memview)) __PYX_ERR(1, 15, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  __pyx_v_idx = __pyx_t_9;
  __pyx_t_9.memview = NULL;
  __pyx_t_9.data = NULL;
+16:     out = alphabet[idx]
  __pyx_t_7 = __pyx_memoryview_fromslice(__pyx_v_idx, 1, (PyObject *(*)(char *)) __pyx_memview_get_long, (int (*)(char *, PyObject *)) __pyx_memview_set_long, 0);; if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 16, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __pyx_t_1 = __Pyx_PyObject_GetItem(((PyObject *)__pyx_v_alphabet), __pyx_t_7); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 16, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  __pyx_t_5 = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_5.memview)) __PYX_ERR(1, 16, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_out, 1);
  __pyx_v_out = __pyx_t_5;
  __pyx_t_5.memview = NULL;
  __pyx_t_5.data = NULL;
+17:     return np.array(out)
  __Pyx_XDECREF(__pyx_r);
  __Pyx_GetModuleGlobalName(__pyx_t_7, __pyx_n_s_np); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_7, __pyx_n_s_array); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  __pyx_t_7 = __pyx_memoryview_fromslice(__pyx_v_out, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __pyx_t_2 = NULL;
  __pyx_t_4 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_2 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_2)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_2);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_4 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_2, __pyx_t_7};
    __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_4, 1+__pyx_t_4);
    __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0;
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
    if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 17, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  }
  __pyx_r = __pyx_t_1;
  __pyx_t_1 = 0;
  goto __pyx_L0;

Yellow above indicates access to Python internals (which are slow). So there is our reason, we are still just using python (numpy) functions. So to speed this up, we have to exchange the numpy functions with C functions. Maybe it’s better if we start from the Python function?

%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
#-c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
cimport numpy as np
import cython

@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet):
    cdef double complex[:] out
    cdef int i, j, idx, N, M
    cdef double d, dmin
    N = signal.size
    M = alphabet.size
    out = np.zeros_like(signal)
    for i in range(N):
        dmin = 1000.
        idx = 0
        for j in range(M):
            d = abs(signal[i]-alphabet[j])
            if d < dmin:
                idx = j
                dmin = d
        out[i] = idx
    return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.46 ms ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Finally a speed improvement! Not a lot though. The reason is that we are still calling a Python function inside the loop (abs). Let us use the C-function instead

%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
cimport numpy as np
cimport cython

cdef extern from "complex.h" nogil:
    double cabs(double complex)

    
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
    cdef double complex[:] out
    cdef int i, j, idx, N, M
    cdef double d, dmin
    N = signal.size
    M = alphabet.size
    out = np.zeros_like(signal)
    for i in range(N):
        dmin = 1000.
        idx = 0
        for j in range(M):
            d = cabs(signal[i]-alphabet[j])
            if d < dmin:
                idx = j
                dmin = d
        out[i] = idx
    return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.44 ms ± 773 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

OK now we have an interesting speed improvement another factor >10 compared to the numpy code. We can actually get even more by using parallel processing. Cython loops that do not use python code can release the GIL

%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp --link-args=-fopenmp
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange

cdef extern from "complex.h" nogil:
    double cabs(double complex)

    
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
    cdef double complex[:] out
    cdef int i, j, idx, N, M
    cdef double d, dmin
    N = signal.size
    M = alphabet.size
    out = np.zeros_like(signal)
    for i in prange(N, nogil=True):
        dmin = 1000.
        idx = 0
        for j in range(M):
            d = cabs(signal[i]-alphabet[j])
            if d < dmin:
                idx = j
                dmin = d
        out[i] = idx
    return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
374 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Another speed up almost a factor of 100 compared to the numpy code! Let us check what the annotated code looks like now

%%cython --annotate
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange

cdef extern from "complex.h" nogil:
    double cabs(double complex)

    
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
    cdef double complex[:] out
    cdef int i, j, idx, N, M
    cdef double d, dmin
    N = signal.size
    M = alphabet.size
    out = np.zeros_like(signal)
    for i in prange(N, nogil=True):
        dmin = 1000.
        idx = 0
        for j in range(M):
            d = cabs(signal[i]-alphabet[j])
            if d < dmin:
                idx = j
                dmin = d
        out[i] = idx
    return np.array(out)
Cython: _cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13.pyx

Generated by Cython 3.0.8

Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.

+01: import numpy as np
  __pyx_t_7 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_7) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
/* … */
  __pyx_t_7 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_7) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
 02: cimport numpy as np
 03: cimport cython
 04: from cython.parallel import prange
 05: 
 06: cdef extern from "complex.h" nogil:
 07:     double cabs(double complex)
 08: 
 09: 
+10: @cython.boundscheck(False) # won't check that index is in bounds of array
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy = {"make_decision_cy", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy(PyObject *__pyx_self, 
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
  __Pyx_memviewslice __pyx_v_signal = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_alphabet = { 0, 0, { 0 }, { 0 }, { 0 } };
  #if !CYTHON_METH_FASTCALL
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  #endif
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("make_decision_cy (wrapper)", 0);
  #if !CYTHON_METH_FASTCALL
  #if CYTHON_ASSUME_SAFE_MACROS
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args); if (unlikely(__pyx_nargs < 0)) return NULL;
  #endif
  #endif
  __pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
  {
    PyObject **__pyx_pyargnames[] = {&__pyx_n_s_signal,&__pyx_n_s_alphabet,0};
  PyObject* values[2] = {0,0};
    if (__pyx_kwds) {
      Py_ssize_t kw_args;
      switch (__pyx_nargs) {
        case  2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
      switch (__pyx_nargs) {
        case  0:
        if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_signal)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 10, __pyx_L3_error)
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (likely((values[1] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_alphabet)) != 0)) {
          (void)__Pyx_Arg_NewRef_FASTCALL(values[1]);
          kw_args--;
        }
        else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 10, __pyx_L3_error)
        else {
          __Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, 1); __PYX_ERR(0, 10, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        const Py_ssize_t kwd_pos_args = __pyx_nargs;
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "make_decision_cy") < 0)) __PYX_ERR(0, 10, __pyx_L3_error)
      }
    } else if (unlikely(__pyx_nargs != 2)) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
      values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
    }
    __pyx_v_signal = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(values[0], PyBUF_WRITABLE); if (unlikely(!__pyx_v_signal.memview)) __PYX_ERR(0, 14, __pyx_L3_error)
    __pyx_v_alphabet = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(values[1], PyBUF_WRITABLE); if (unlikely(!__pyx_v_alphabet.memview)) __PYX_ERR(0, 14, __pyx_L3_error)
  }
  goto __pyx_L6_skip;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, __pyx_nargs); __PYX_ERR(0, 10, __pyx_L3_error)
  __pyx_L6_skip:;
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L3_error:;
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_signal, 1);
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_alphabet, 1);
  __Pyx_AddTraceback("_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_make_decision_cy(__pyx_self, __pyx_v_signal, __pyx_v_alphabet);
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;

  /* function exit code */
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_signal, 1);
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_alphabet, 1);
  {
    Py_ssize_t __pyx_temp;
    for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
      __Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
    }
  }
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_make_decision_cy(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_signal, __Pyx_memviewslice __pyx_v_alphabet) {
  __Pyx_memviewslice __pyx_v_out = { 0, 0, { 0 }, { 0 }, { 0 } };
  int __pyx_v_i;
  int __pyx_v_j;
  int __pyx_v_idx;
  CYTHON_UNUSED int __pyx_v_N;
  int __pyx_v_M;
  double __pyx_v_d;
  double __pyx_v_dmin;
  PyObject *__pyx_r = NULL;
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __PYX_XCLEAR_MEMVIEW(&__pyx_t_6, 1);
  __Pyx_AddTraceback("_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XCLEAR_MEMVIEW(&__pyx_v_out, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__22 = PyTuple_Pack(10, __pyx_n_s_signal, __pyx_n_s_alphabet, __pyx_n_s_out, __pyx_n_s_i, __pyx_n_s_j, __pyx_n_s_idx, __pyx_n_s_N, __pyx_n_s_M, __pyx_n_s_d, __pyx_n_s_dmin); if (unlikely(!__pyx_tuple__22)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__22);
  __Pyx_GIVEREF(__pyx_tuple__22);
/* … */
  __pyx_t_7 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy, 0, __pyx_n_s_make_decision_cy, NULL, __pyx_n_s_cython_magic_31fd376cc9d559b735, __pyx_d, ((PyObject *)__pyx_codeobj__23)); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_make_decision_cy, __pyx_t_7) < 0) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
 11: @cython.wraparound(False) # array[-1] won't work
 12: @cython.nonecheck(False) # variables are never set to None
 13: @cython.cdivision(True) # don't protect against dividing by zero
 14: def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
 15:     cdef double complex[:] out
 16:     cdef int i, j, idx, N, M
 17:     cdef double d, dmin
+18:     N = signal.size
  __pyx_t_1 = __pyx_memoryview_fromslice(__pyx_v_signal, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 18, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_1, __pyx_n_s_size); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 18, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_3 = __Pyx_PyInt_As_int(__pyx_t_2); if (unlikely((__pyx_t_3 == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 18, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_v_N = __pyx_t_3;
+19:     M = alphabet.size
  __pyx_t_2 = __pyx_memoryview_fromslice(__pyx_v_alphabet, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 19, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_1 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_size); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 19, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_3 = __Pyx_PyInt_As_int(__pyx_t_1); if (unlikely((__pyx_t_3 == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 19, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_M = __pyx_t_3;
+20:     out = np.zeros_like(signal)
  __Pyx_GetModuleGlobalName(__pyx_t_2, __pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 20, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_4 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros_like); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 20, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = __pyx_memoryview_fromslice(__pyx_v_signal, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 20, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_5 = NULL;
  __pyx_t_3 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_4))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_4);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_4, function);
      __pyx_t_3 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_2};
    __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_4, __pyx_callargs+1-__pyx_t_3, 1+__pyx_t_3);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 20, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  }
  __pyx_t_6 = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_6.memview)) __PYX_ERR(0, 20, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_out = __pyx_t_6;
  __pyx_t_6.memview = NULL;
  __pyx_t_6.data = NULL;
+21:     for i in prange(N, nogil=True):
  {
      #ifdef WITH_THREAD
      PyThreadState *_save;
      _save = NULL;
      Py_UNBLOCK_THREADS
      __Pyx_FastGIL_Remember();
      #endif
      /*try:*/ {
        __pyx_t_3 = __pyx_v_N;
        {
            #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
                #undef likely
                #undef unlikely
                #define likely(x)   (x)
                #define unlikely(x) (x)
            #endif
            __pyx_t_8 = (__pyx_t_3 - 0 + 1 - 1/abs(1)) / 1;
            if (__pyx_t_8 > 0)
            {
                #ifdef _OPENMP
                #pragma omp parallel
                #endif /* _OPENMP */
                {
                    #ifdef _OPENMP
                    #pragma omp for lastprivate(__pyx_v_d) lastprivate(__pyx_v_dmin) firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_idx) lastprivate(__pyx_v_j)
                    #endif /* _OPENMP */
                    for (__pyx_t_7 = 0; __pyx_t_7 < __pyx_t_8; __pyx_t_7++){
                        {
                            __pyx_v_i = (int)(0 + 1 * __pyx_t_7);
                            /* Initialize private variables to invalid values */
                            __pyx_v_d = ((double)__PYX_NAN());
                            __pyx_v_dmin = ((double)__PYX_NAN());
                            __pyx_v_idx = ((int)0xbad0bad0);
                            __pyx_v_j = ((int)0xbad0bad0);
/* … */
      /*finally:*/ {
        /*normal exit:*/{
          #ifdef WITH_THREAD
          __Pyx_FastGIL_Forget();
          Py_BLOCK_THREADS
          #endif
          goto __pyx_L5;
        }
        __pyx_L5:;
      }
  }
+22:         dmin = 1000.
                            __pyx_v_dmin = 1000.;
+23:         idx = 0
                            __pyx_v_idx = 0;
+24:         for j in range(M):
                            __pyx_t_9 = __pyx_v_M;
                            __pyx_t_10 = __pyx_t_9;
                            for (__pyx_t_11 = 0; __pyx_t_11 < __pyx_t_10; __pyx_t_11+=1) {
                              __pyx_v_j = __pyx_t_11;
+25:             d = cabs(signal[i]-alphabet[j])
                              __pyx_t_12 = __pyx_v_i;
                              __pyx_t_13 = __pyx_v_j;
                              __pyx_v_d = cabs(__Pyx_c_diff_double((*((__pyx_t_double_complex *) ( /* dim=0 */ (__pyx_v_signal.data + __pyx_t_12 * __pyx_v_signal.strides[0]) ))), (*((__pyx_t_double_complex *) ( /* dim=0 */ (__pyx_v_alphabet.data + __pyx_t_13 * __pyx_v_alphabet.strides[0]) )))));
+26:             if d < dmin:
                              __pyx_t_14 = (__pyx_v_d < __pyx_v_dmin);
                              if (__pyx_t_14) {
/* … */
                              }
                            }
+27:                 idx = j
                                __pyx_v_idx = __pyx_v_j;
+28:                 dmin = d
                                __pyx_v_dmin = __pyx_v_d;
+29:         out[i] = idx
                            __pyx_t_13 = __pyx_v_i;
                            *((__pyx_t_double_complex *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_13 * __pyx_v_out.strides[0]) )) = __pyx_t_double_complex_from_parts(__pyx_v_idx, 0);
                        }
                    }
                }
            }
        }
        #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
            #undef likely
            #undef unlikely
            #define likely(x)   __builtin_expect(!!(x), 1)
            #define unlikely(x) __builtin_expect(!!(x), 0)
        #endif
      }
+30:     return np.array(out)
  __Pyx_XDECREF(__pyx_r);
  __Pyx_GetModuleGlobalName(__pyx_t_4, __pyx_n_s_np); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_4, __pyx_n_s_array); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_t_4 = __pyx_memoryview_fromslice(__pyx_v_out, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_5 = NULL;
  __pyx_t_8 = 0;
  #if CYTHON_UNPACK_METHODS
  if (unlikely(PyMethod_Check(__pyx_t_2))) {
    __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_2);
    if (likely(__pyx_t_5)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_2);
      __Pyx_INCREF(__pyx_t_5);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_2, function);
      __pyx_t_8 = 1;
    }
  }
  #endif
  {
    PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_4};
    __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_2, __pyx_callargs+1-__pyx_t_8, 1+__pyx_t_8);
    __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  }
  __pyx_r = __pyx_t_1;
  __pyx_t_1 = 0;
  goto __pyx_L0;

Pythran#

We could get a speed improvement of a factor of almost 100 over the numpy code using cython. But the code became significantly more complex. Let us look at Pythran now to see what sort of improvements we get with it.

import pythran
%load_ext pythran.magic

Starting point#

We again start from the numpy function. Pythran works a bit differently to Cython. Instead of adding a type annotations to the code we simply use a comment to indicate the types for the parameters.

%%pythran
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
def make_decision_pt(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out

Pythran also allows to add multiple annotations for passing different types

%%pythran
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
#pythran export make_decision_pt(complex64[], complex64[])
def make_decision_pt(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out
%timeit make_decision_pt(qam64[0], qam64.coded_symbols)
2.41 ms ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
make_decision_pt(qam64[0].astype(np.complex64), qam64.coded_symbols.astype(np.complex64))
array([-0.15430336+0.77151674j,  0.46291006+1.0801234j ,
        0.77151674+0.77151674j, ..., -0.77151674+0.15430336j,
       -0.77151674-0.46291006j, -0.46291006-1.0801234j ], dtype=complex64)

Again not much gain over numpy.

However We did not use any compiler optimizations

Try again but use compiler optimizations

%%pythran -O3 -march=native -ffast-math -mfpmath=sse -fopenmp
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
def make_decision_pt(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out
%timeit make_decision_pt(qam64[0], qam64.coded_symbols)
2.3 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Thats another factor of 10 over the numpy code. Pythran can even use SIMD instructions. Lets see how much difference it makes here

%%pythran -O3 -march=native -ffast-math -mfpmath=sse -fopenmp -DUSE_XSIMD
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
def make_decision_pt(signal, alphabet):
    out = np.zeros_like(signal)
    d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
    idx = np.argmin(d, axis=0)
    out = alphabet[idx]
    return out
%timeit make_decision_pt(qam64[0], qam64.coded_symbols)
2.28 ms ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let’s see if we can get some more gains by splitting up the loop like in cython. We can also use openmp (multi-threading) support.

%%pythran -O3 -march=native -ffast-math -mfpmath=sse -fopenmp
import numpy as np
#pythran export make_decision_pt2(complex128[], complex128[])

def make_decision_pt2(signal, alphabet):
    out = np.zeros_like(signal)
    #omp parallel for
    for i in range(signal.size): # we only do 1D here
        dmin = 1000
        idx = 0
        for j in range(alphabet.size):
            d = abs(signal[i]-alphabet[j])
            if d < dmin:
                idx = j
                dmin = d
        out[i] = alphabet[idx]
    return out
%timeit make_decision_pt2(qam64[0], qam64.coded_symbols)
298 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Conclusions#

Both cython and pythran can yield significant speed improvements over python or even numpy code. The advantage of cython is that it is very generic and can speed up almost any code. Pythran on the other hand is not as general, but geared numerical applications, for this sort of code it can often yield impressive speed gains with very little adjustment to the original code (often only a comment is required).

Other packages#

Another very popular package for speeding up python code is Numba, which is a just-in-time (JIT) compiler. Numba can also get high speed gains, often very